A Probabilistic Algorithm for Segmenting Non-Kanji Japanese Strings

نویسندگان

Virginia Teller

Eleanor Olds Batchelder

چکیده

We present an algorithm for segmenting unrestricted Japanese text that is able to detect up to 98% of the words in a corpus. The segmentation technique, which is simple and extremely fast, does not depend on a lexicon or any formal notion of what a word is in Japanese, and the training procedure does not require annotated text of any kind. Relying almost exclusively on character type information and a table of hiragana bigram frequencies, the algorithm makes a decision as to whether to create word boundaries or not. This method divides strings of Japanese characters into units that are computationally tractable and that can be justified on lexical and syntactic grounds as well. part-of-speech taggers have been used to obtain information about the lexical, syntactic, and some semantic properties of large corpora. Automatic text tagging is an important first step in discovering the linguistic structure of large text corpora.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Segmenting Sentences into Linky Strings Using D-bigram Statistics

It is obvious that segmentation takes an important role in natural language processing(NLP), especially for the languages whose sentences are not easily separated into morphemes. In this s tudy we propose a method of segmenting a sentence. The system described in this paper does not use any grammatical information or knowledge in processing. Instead, it uses statistical information drawn from n...

متن کامل

Acquired Dyslexia in Japanese : Implications for Reading Theory

Acquired dyslexia research has been conducted mainly on English neurological patients. A limited number of dyslexia studies on non-alphabetic orthographies are available. Classical case studies for acquired dyslexia in Japanese, which has two distinctive scripts (morphographic Kanji and phonographic Kana), reported 'script-dependent' dyslexia patterns. Although recent case studies showed 'scrip...

متن کامل

T R 99 - 1 75 6 Unsupervised Statistical Segmentation of Japanese Kanji Strings

Word segmentation is an important issue in Japanese language processing because Japanese is written without space delimiters between words. We propose a simple dictionary-less method to segment Japanese kanji sequences into words based solely on character n-gram counts from an unannotated corpus. The performance was often better than that of rule-based morphological analyzers over a variety of ...

متن کامل

Kana-Kanji Conversion System with Input Support Based on Prediction

1 I n t r o d u c t i o n TOSHIBA developed the world's first Japanese word processor in 1978. Unlike languages based on an alphabet , Japanese uses /,housands of Ica nji characters of varying comp]exity. Hence, l,o arrange all of l~a'~:ii chm'acl;ers on keyboard is; difficult. On the other hand, kana dlaracters which are phonetic scripl,s of Japanese have 83 variations; these can be arranged o...

متن کامل

Normal and impaired reading of Japanese kanji and kana

Two kinds of scripts are used in the written forms of Japanese words: morphographic kanji and phonographic kana. Whereas each kana character invariably represents a single pronunciation, the majority of kanji characters have two or more legitimate pronunciations, with one appropriate to the character in any given word. Furthermore, each kanji character has meaning while a kana character does no...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1994

A Probabilistic Algorithm for Segmenting Non-Kanji Japanese Strings

نویسندگان

چکیده

منابع مشابه

Segmenting Sentences into Linky Strings Using D-bigram Statistics

Acquired Dyslexia in Japanese : Implications for Reading Theory

T R 99 - 1 75 6 Unsupervised Statistical Segmentation of Japanese Kanji Strings

Kana-Kanji Conversion System with Input Support Based on Prediction

Normal and impaired reading of Japanese kanji and kana

عنوان ژورنال:

اشتراک گذاری